*! version 5.0
* 13 August 2018
* NIDS

* THIS IS A FOOD AND NON-FOOD EXPENDITURE DO FILE: 5 OF 14

*=====================================================================================================================================
* GLOBALS FOR DATA FILES, DO FILES AND VERSION SUFFIXES

* DEFINED IN "W1 Food_NonFood Expenditure - Master  Food_NonFood Expenditure do file  (1 of 14).do"

*=====================================================================================================================================
* SETTING UP STATA TO RUN DO FILES

clear
cap clear matrix
set more off 

**********************************************************************
*					FOOD 2
*				Imputing for food items
**********************************************************************
*			PURE DEMOGRAPHICS (With Income) MODEL
**********************************************************************
use "$DataOUT\tempdata3.dta", clear

forvalues a=1/32{
gen total`a'lg=log(total`a')
}

*Runs regression imputations (on log values) for each food item, on income, dwelling size, province dummys, urban/rural dummy, 
*household size, age of eldest member, matric in household dummy, Race dummys, corrugated roof dummy, grant dummy, children dummy
forvalues a=1/32{
impute total`a'lg lgincome w1_h_dwlrms westerncape easterncape northerncape freestate kwazulunatal northwest gauteng mpumalanga  ///
urban hhsizer maxage fammatric  Asian White Coloured corru grants anychildren if e1_2_`a'==1,gen(total`a'lgimp)
}

*Generates a new consumption value, which (for now) is the same the reported consumption
forvalues a=1/32{
gen imptotal`a'= total`a' if e1_2_`a'==1
}

*Replaces instances of missing consumption with the values imputed earlier
forvalues a=1/32{
replace imptotal`a'= exp(total`a'lgimp) if e1_2_`a'==1&imptotal`a'==.
}

**********************************************************************
*				CELL MEDIAN MODEL
**********************************************************************

*Generates a imputed value, first just with the original (raw) data
forvalues a=1/32{
gen fmedianimp`a'=total`a'
}
global rate "0.6"

*Replaces missing values in the data to the median of the PSU for that item provided the response rate is high enough within the PSU
forvalues a=1/32{
replace fmedianimp`a'= psumedian`a' if total`a'==.&psumedian`a'!=.&psucount`a'>1&e1_2_`a'==1&psufrate`a'>$rate
}

*Replaces values that are still missing to the median of District for that item provided the response rate is high enough within the Disctrict
forvalues a=1/32{
replace fmedianimp`a'= dismedian`a' if fmedianimp`a'==.&dismedian`a'!=.&e1_2_`a'==1&disfrate`a'>$rate
}

*Generates the median of the food item within the province and replaces any values that are still missing with this provincial median
forvalues a=1/32{
quietly egen provmedian`a' = median(total`a'), by(province)
quietly replace fmedianimp`a'= provmedian`a' if fmedianimp`a'==.&provmedian`a'!=.&e1_2_`a'==1
}

**********************************************************************
*			Imputation Rates and Post Analysis
**********************************************************************
*This section derives the number of observations that were imputed and allows the user to compare the different imputation methods
*as well as adjust the global cut off rate for the median method
**********************************************************************

*Generates a Dummy Variable for this food itme having a imputed value but no original value (in other words a dummy for imputation)
forvalues a=1/32{
gen total`a'imputed =1 if imptotal`a'!=.&total`a'==.
quietly replace total`a'imputed=0 if imptotal`a'==.|total`a'!=.
}

***Counts the number of Food Items that were imputed for each household
egen foodsubimp=rowtotal(total*imputed)

*Recodes the dummy variable to allow for counting
recode total*imputed (0=.)

*Generates variables counting the number of values observed (before imputation), number of imputed variables and finally the quotient
*of the two (the percentage of observations that were imputed), for each food type

forvalues a=1/32{
egen small`a' = count(total`a')
egen impcount`a' =count(total`a'imputed)
gen noImputed`a' =impcount`a'/small1
}

*Recodes back after counting
recode total*imputed (.=0)

*Output of the Rates of Imputation for each value
sum noImputed*

*--------------------------------------------------------------------------------------------------------------
***Analysis of values that was Imputed

*Generates a variable containing only the consumption values that WERE imputed for each food type. This is done for both methods 
*of imputation (median and regression)
forvalues a=1/32{
quietly gen fmedianimp`a'meds= fmedianimp`a' if total`a'imputed==1
quietly gen imptotal`a'imps= imptotal`a' if total`a'imputed==1
}

egen ftotalmeds = rowtotal(fmedianimp*meds)
replace ftotalmeds =. if ftotalmeds ==0
egen ftotalimps = rowtotal(imptotal*imps)
replace ftotalimps =. if ftotalimps ==0

sum ftotalmeds ftotalimps, detail

gen lgftotalmeds  =log(ftotalmeds)
gen lgftotalimps =log(ftotalimps)
gen lgtotalfood =log(totalfood)

*twoway (kdensity lgftotalmeds  ) (kdensity lgftotalimps ), legend(order(1 "Expenditure Imputed" 2 "Expenditure Median"))

drop lgftotalmeds lgftotalimps lgtotalfood 

save "$DataOUT\tempdata4.dta", replace
*---------------------------------------------------------------------------------------------------------------------------
